-
Couldn't load subscription status.
- Fork 700
[ET-VK] Quantized Int8 Convolution + Linear #13811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Title says it all! This PR adds implementations for int8 quantized convolution and linear layers. Convolution is implemented as matrix multiplication under the hood by using the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13811
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 9 New FailuresAs of commit 501f89c with merge base e2098f8 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
This pull request was exported from Phabricator. Differential Revision: D81323424 |
Stack from ghstack (oldest at bottom):
Title says it all!
This PR adds implementations for int8 quantized convolution and linear layers. Convolution is implemented as matrix multiplication under the hood by using the im2col procedure.
For both linear and convolution, two versions are implemented:
q8ta_q8cswvariant which quantized the input tensor and then performs integer accumulation via the int8 dot product extensionq8cswvariant which dequantized the weight tensor in-shader and performs floating point accumulation.The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.
These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.
Differential Revision: D81323424